-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Investigate using doc values for logsdb _id #128404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate using doc values for logsdb _id #128404
Conversation
Source only snapshots only include stored fields, no doc values. Also tried adding doc values to source only snapshots, but snapshots only include necessary files. Tried including dvd/dvm files if field name is _id, but doc values could be in compound file, and can't include these. Instead try loading _id from _source. But this only works if _id was provided in source (and it probably wasn't). Currently this just produces a null _id value. This may be okay, since source only snapshots usually need to be reindexed for any searching. Though not having an _id field could be problematic.
Benchmark resultsTest commit: 2f3b488 (this PR) Ran the The following table shows the per-dataset size difference, along with the difference in the
All but one dataset saw a decrease in on-disk size. This was despite the fact that, surprisingly, for 6 datasets the storage for the For comparison, the following shows the total size for each data sets broken out into stored fields, inverted index, and docs values.
Finally, here is a comparison of the rally test. (Though we shouldn't put to much stock in these results, and there is probably some optimization to be done). |
|
@martijnvg Thoughts on this? |
|
Thanks @parkertimmins for researching this!
That is a good saving! In the baseline would does _id disk usage compare to the other fields? We didn't check the current disk usage by field in a while.
I think this is where the difficulty is. We need to avoid get by id operations from being a linear scan. We could think of ways to avoid that in many cases. Like for example adding metadata (e.g. timestamp) to the _id (encoded, when return in _search and other apis), then use that to reduce the number of docs a linear scan is needed for (in get api, id query etc.). However this is a big task. Additionally, indexing throughput looks within noise (indexing latencies are always very noisy), but I think for use cases where the id isn't autogenerated by ES, the indexing throughput will also be impacted.
I don't think id field is required to be a stored field. I would not double check this as well.
iirc with stored fields are fields are compressed together, so not using stored fields for _id, removes a lot of unique values that don't compress well. I think a less controversial change is to maybe just make the _id field not a stored, but have doc values (without skip list) and then keep the inverted index for _id field. The win should be much smaller. We should see better stored field compression, like you already observed. I think for now we should table this research / experiment and focus on pattern text field that should give more storage savings. |
|
@martijnvg
And below is the per-index and per-field estimate of storage size for the same benchmark. |
|
Also, here are the per index/format sizes from the previous baseline I ran off of commit 9db1837 . This commit includes the recent changes which removed points from seq_no
|
Remove
_idstored field and inverted index from logsdb, replacing with a doc value. Investigate effects of change on storage size.